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Abstract. The problem of determining the correct order of fluctuation of the optimal align- 
ment score of two random strings of length n has been open for several decades. It is known 
" [12] that the biased expected effect of a random letter-change on the optimal score implies an 

£SJ ■ order of fluctuation linear in yfn. However, in many situations where such a biased effect is 

observed empirically, it has been impossible to prove analytically. The main result of this paper 
fV^ shows that when the rescaled-limit of the optimal alignment score increases in a certain direc- 

tion, then the biased effect exists. On the basis of this result one can quantify a confidence 
■ level for the existence of such a biased effect and hence of an order y/n fluctuation based on 

simulation of optimal alignments scores. This is an important step forward, as the correct order 
of fluctuation was previously known only for certain special distributions |12|.|13|.[5].|10|. To 
illustrate the usefulness of our new methodology, we apply it to optimal alignments of strings 
written in the DNA-alphabet. As scoring function, we use the BLASTZ default-substitution 
matrix together with a realistic gap penalty. BLASTZ is one of the most widely used sequence 
QO ■ alignment methodologies in bioinformatics. For this DNA-setting, we show that with a high 

, level of confidence, the fluctuation of the optimal alignment score is of order O(y'n). An im- 

portant special case of optimal alignment score is the Longest Common Subsequence (LCS) of 
random strings. For binary sequences with equiprobably symbols the question of the fluctuation 
of the LCS remains open. The symmetry in that case does not allow for our method. On the 
other hand, in real-life DNA sequences, it is not the case that all letters occur with the same 
frequency. So, for many real life situations, our method allows to determine the order of the 
' fluctuation up to a high confidence level. 
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s. amsalu, r.a. hauser, and h.f. matzinger 
1. Introduction 



Let x = x\X2 ■ ■ ■ x n and y = y\yi . . . y n be two finite strings written with symbols from a 
finite alphabet A. An alignment with gaps ir of x and y is a strictly increasing integer sequence 
contained in [l,n] x [l,n]. Thus, 

71" = (fJ>2, • • • , {Hk, Vk)) 

where 1 < fii < fj,2 < ■ ■ ■ < ^k < n an d 1 < ^i < V2 < ■ ■ ■ < v k < n - The alignment it aligns 
the symbol x^ with for i = 1,2, ... ,k. The symbols in the strings x and y that are not 
aligned with a letter are said to be aligned with a gap. We will use the symbol g to denote 
gaps and write A* = A U {#} for the augmented alphabet. A scoring function is a map <S from 
^4* x A* to the set of real numbers. In everything that follows, we take S to be symmetric so 
that S(c,d) = S(d,c) for all c,d <E A*. The alignment score according to S under an alignment 
7r of two strings x and y is defined as 

k 

S n (x,y) :=^S{x^,y Vi ) + ^S(xj,g) + ^S(g,yj), 
i=i i^M 

where /U = {m, . . . , fi k } and v := {vi, ... , v k }. 

An optimal alignment of two strings x and y is an alignment with gaps that maximizes the 
alignment score for a given scoring function. Note that the set of optimal alignments depends 
thus not only on x and y, but also on the scoring function S. 

As an example of an alignment with gaps, let us assume that one species' DNA contains the string x = 
AGTTCG and another's the string y = AATTAC, where x and y are thought of as potentially related. Consider 
the alignment n given by the following diagram, 



X 




A 
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T 
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A 
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C 





The alphabet A we consider in this example is A = {A, T, C, G}, and A* = A U {g} is the augmented alphabet. 
The alignment score under n of x and y is given by 

S*(x, y) := S(A, A) + S(G, A) + S(T, T) + S(T, T) + S(g, A) + S(C, C) + S(G, g). 

In this example tt is an optimal alignment when S assigns a score of 1 to identical letters and a score of —1 for 
two different letters aligned to one other or a letter aligned with a gap. 

Alignment scores are widely used in bioinformatics and natural language processing. In com- 
putational genetics, gaps are interpreted as letters that disappeared in the course of evolution. 
The historical alignment of two DNA-sequences is the alignment with gaps that aligns letters 
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that evolved from the same letter in the common ancestral DNA. This alignment is unknown, 
but if it were available, it would yield information about how closely related two biological 
species are, how long ago their genomes started to diverge, and what the phylogenetic tree of 
a chosen set of species looks like. An important task in bioinformatics is therefore to estimate 
which alignment is most likely to be the "historic alignment" . 

When the scoring function is the log-likelihood that two letters evolved from a common ances- 
tral letter, alignments with maximal alignment score are also the most likely historic alignments, 
assuming that letters mutate or get deleted independently of their neighbors. This observation 
is the basis for using optimal alignment scores to test whether two sequences are related or 
not. Unrelated sequences should be stochastically independent, and this should be reflected by 
a lower optimal alignment score. To understand how powerful such a relatedness test is, one 
needs to understand the size of the fluctuation of the optimal alignment score, but the fluctu- 
ation depends of course on the stochastic model used for unrelated DNA sequences and on the 
scoring function. 

In this paper we consider two finite random strings X = X1X2 ■ ■ ■ X n and Y = Y1Y2 . . . Y n of 
length n in which all letters Xi {i = 1, . . . , n) and Yj (J = 1, . . . , n) are i.i.d. random variables 
that take values in a given finite alphabet A. For any letter a & A, let p a denote the probability 

Pa = P(Xi = a) = P(Yj = a). 

Let S : .4* x A* — > M. be a scoring function. We denote the optimal alignment score of X = 
X\ . . . X n and Y = Y% . . . Y n according to S by 

L n {S) :=max5,(A,y) = max^(A 1 A 2 ...A n ,y 1 y 2 ...y n ), 

7T TV 

where the maximum is taken over all the alignments 7r with gaps aligning X and Y. Let X n (S) 
denote the rescaled expected optimal alignments score 

n 

A simple subadditivity argument [6] shows that X n (S) converges as n goes to infinity. We denote 
this limit by X(S) and hence 

A(5) := lim A n (5) = lim ^M^l. 



The rate of convergence of the last limit above , was bounded by Alexander [3], [2]. We also 
give our own bound in the Appendix. 
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One of the important questions concerning optimal alignments is the asymptotic order of 
fluctuation when n goes to infinity. Although McDiarmid's inequality implies that yAi?[L n ] is 
at most of order 0(n), it has been a long standing open problem as to whether or not this upper 
bound is tight up to a multiplicative constant, in other words, whether 

(1.1) VAR[L n {S)\ = G(n) 

holds. Steel |16j has proven that for the Longest Common Subsequenc case, (which is a special 
case of optimal alignment with the scoring function being the identity matrix and a zero gap- 
penalty), one has VAR[L n (S)] < n. The rate of convergence Several conflicting conjectures have 
been proposed about this problem: While Watermann conjectured |17| that the order is indeed 
given by (jl.ip . Chvatal and Sankoff conjectured a different order [6] which would be more in 
line with corresponding results on Last Passage Percolation (LPP) models, where there exist 
several situations PQ,[1] in which it known that the order of fluctuation is the third root of the 
order of the expectation. 

Optimal alignment scores can be reformulated as a LPP problem with correlated weigths. We 
find it interesting and surprising, that our results are totally different from the order found in 
others LPP models. 

In several special cases |12j.|13j.|10j. the order (jl.ip has been proven analytically. In each case 
the proof was based on the technique of reducing the fluctuation problem to the biased effect 
of a random change in the sequences: In |5j and |12j it was established that if changing one 
letter at random has a positive biased effect on L n (S), then the order (jl.ip must hold. More 
specifically, for two given letters a, b G A, let (X,Y) denote the sequence-pair obtained from 
(X, Y) by changing exactly one entry, chosen uniformly at random among all the letters a that 
appear in X and Y, into a b. 

Take for example, x = aababc and y — abbbbb. Then, there are a total of 4 a's when we count all the a's in 
both sequences together. Each of these a's has thus a probability of 1/4 to get chosen and replaced by a 6. Since 
only one a is changed in both strings x and y, we have that after our letter change one of the strings will remain 
identical and the other will be changed by one letter. Let us denote by x and y the sequences after the change. In 
this example, the a in y has a probability of 1/4 to be chosen. If it gets chosen y is transformed into bbbbbb. So, 
we have P(y — bbbbbb, x = x) = 1/4. There are 3 a3's in x. So the probability that x get changed is 3/4. Hence, 



P(x £ {bababc, abbabc, aabbbc} , y = y) = 3/4. 
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Let us denote the optimal alignment score of X and Y by 

L n (S) =max5 7r (X,F). 

7T 

In [12], it was now shown that if there is a constant c > 0, not depending on n, such that 

E[L n (S) - L n (S)\X,Y] >c 

holds with high probability, then the order (II. ip follows. In this context "high probability" is 
defined as a probability 1 — 0(n~ an ), for some constant a > that does not depend on n. An 
alternative proof of this result is given in Lemma 12.11 in the next section. 

One of the shortcomings of the above-cited papers [12], [5] is that a strong asymmetry is 
required in the distribution on A used to generate the random strings X, Y for it to be pos- 
sible to prove the existence of a biased effect of random letter changes. In many situations of 
relevance to applications, the biased effect is visible in simulations but cannot be established 
analytically using the techniques from [12] . The present paper addresses this problem: Theorem 
12. II establishes that as soon as 

(1.2) A(5) - X(S - eT) > 

for any e > 0, the biased effect of a random letter change exists, and this in turn implies the 
fluctuation order (jl.ip . In this context, let a and b be fixed elements of A, and let T : A* x A* — > 
R be the scoring function given by T(a, c) = T(c, a) := S(b, c) — S(a, c) for all c £ A* with c / a 
and T(d, c) = when d,c ^ a. Furthermore, let T(a, a) := 2(5(6, c) — S(a, c)). 

The practical importance of this result is that the validity of Condition (II. 2p can be verified 
by Monte Carlo simulation up to any desired confidence level, and this in turn yields a test on 
whether the order (II. ID holds, at the same confidence level. A practical example of such a test 
is given in Section [5] 

2. Details of the results 

We will consider strings of length n written with letters from a finite alphabet A. 

Consider for example, the following two strings x = babbababbba and y — bbbbabbbabb. We consider alignments 
with gaps of two sequences This means the letters are aligned with a letter or with a gap. Let us see an example 
of an alignment with gaps n of x with y: 
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Alignment with gaps are used to compare similar sequences. For this purpose one uses a scoring function S from 
A* x .4*. Here .4* represents the alphabet A augmented by a symbol G representing the gap. The scoring function 
should measure how close letters are. The total score of an alignment is denoted by S-n(x,y). It is the sum of the 
scores of the aligned symbols pairs. In the present example, the alignment score for the alignment n is equal to: 

S v (x, y) := S(b, b) + S{a, b) + 2S(b, b) + S(a, a) + S{b, b) + S{a, b) + S(b, b) + S(G, a) + 25(6, b, ) + S(a, G). 

An alignment which maximizes for given strings x and y the alignment score is called optimal alignment. Of 
course which alignment is optimal depends on the scoring function we use. We will count the number of aligned 
symbol pairs appearing in an alignment of x with y. In the example of alignment 7r presently under consideration 
- see 1|2.1[) - we have 7 times b aligned with itself. We denote the number of times we see a b aligned with a b 
by Q n (b,b). Hence in our example: Q„(b,b) = 7. In general, for any two letters c, d from A*, let Q n (c,d) be 
the total number of columns where c from x gets aligned with a d from y. Now, clearly we can write the total 
alignment score in terms of the values Qtv(c, d): 

(2.2) S4x,y)= J2 S(c,d)-Q^c,d) 

We are next going to consider the effect of changing a randomly chosen a in x or y into a b. Among all the a's in 
x and y we chose exactly one with equal probability, so that the chosen letter will be either in x or in y. Let x 
and y denote the sequences x and y after our random letter change. Note that either x — x or y — y, as only one 
letter changed. We want to calculate the expected change: 

E[S*{x,y) - S v (x,y)] 

We find the following formula 

(2.3) E[St{x,y) - S„(x,y)] = — V (Q n {a,c)(S(b,c) ~ S{a,c)) + Q n (c,a){S(c,b) - S{c,a))) 

where n a denotes the total number of a's in both strings x and y counted together. 

To understand formula (|2.3|) consider the example of an alignment -n given in (|2.1[) and let us calculate the 
expected change in alignment score due to our random change. In x there are two a's which are aligned with a b. 
When any one of them gets chosen and transformed into b, then the change in alignment score is S(b, b) — S(a, b). 
This event has probability 2/n a . Hence, for our conditional expectation, this adds a term (S(b,b) — S(a,b)) ■ 
2/n a . There is also one letter a in i aligned with a, the change of which to a b results in a change in score of 
S(b,a) — S(a,a). The probability is l/n a , so this contributes (S(b,a) — S(a, a))/n a to the expected change in 
score. Finally, there is one amy which is aligned with an a. If this a gets changed to a b, then the change is 
(S(a, b) — S(a, a)), which happens with probability l/n a . The contribution to the expected change from this letter 
is thus (S(a,b) — S(a,a)) ■ (l/n a ). Now let us assume that the alignment of a gap gives the same value whether 
it is aligned with a or b. When we chose an a aligned with a gap for our random letter change, the score remains 
the same. The contribution of the a's aligned with gaps to the expected change is thus in this case. Summing 
up the above contributions, the expected change of the alignment score in our example is equal to 

= 2(S(b, b) - 5(a, b)) + 1 ■ (S(b, a) - S(a, a)) + 1 ■ (S(a, b) - S(a, a)) 

n a 

where n a = 6. Compare the above formula to 12.31 
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If we now define the functional T: 

T:A*xA*^TSL 

where for all c £ A* where c / a, we have 

T(a,c) := S(b,c) - S{a,c) 

and 

T(c,a) := 5(c,6) - S{c,a) 
and T(c,d) := if d, c a. Furthermore, T(a,a) = 2(5(6, a) - S{a,a)). 

Note that since 5 is symmetric, we also have that T is symmetric. 
The expected effect of our random change of letters corresponds to the "alignment score according to T" rescaled 
by the total number of a's in x and y. So equation 12.31 using T becomes 

(2.4) E[S 1T {x,y)-S^{x,y)] = = — 

'la '"a 

where T n (x,y) denote the score of the alignment 7r aligning x with y and using as scoring function T instead of 
5 . 

Let us next present a theorem which shows that when \{S) — \{S — eT) > 0, then a random 
change of an a into a b has typically a positive biased effect on the optimal alignment score 
L n (S): 

Theorem 2.1. Let A be a finite alphabet and S : A* x A* — > M. a scoring function. Let the 
function T : A* x A* —> R defined as above for two given letters a, b from A. If there exists 
e > 0, such that X(S) — X(S — eT) > 0, then for any given constant 5 > there exists a > so 
that the following holds true for all n large enough, 

(2.5) P[ E[L n (S)-LJS)\X,Y] > A (- S ') ~ X ( S ~ eT ) _ A > 1 _ n -«in(n) > 

V e-pa ) 

where p a = P{X,i = a) = P{Yi = a). 



In several instances [12], [TT], it was proven that when a random change on the strings has 
a positive biased expected effect on the score, then the fluctuation order VAR[L n (S)] = 6 (re) 
applies. For the special framework of the current paper, we prove this fact in Lemma 12. II below. 
Together with Theorem 12. II this result implies that if there exists e > so thatA(S) — \(S— eT) > 
0, then the fluctuation order (jl.ip holds, see Theorem 12.21 

Section [3] is dedicated to proving Theorem 12.11 The main idea behind the proof is quite 
straightforward and will be briefly explained here: Let X = X\ . . . X n and Y = Y\ . . . Y n as 
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before. From Equation (|2.4p in the example above it follows that for any optimal alignment n 
of X and Y according to S, we have 

T n (X,Y) 



(2.6) E[L n (S)-L n (S)\X,Y}> 



N a 



Here T V (X,Y) denotes the alignment score of tt aligning X and Y according to the scoring 
function T. Furthermore N a denotes the total number of a's in X and in Y combined. By 
linearity of the alignment score, we find that 

(2.7) e • T n (X, Y) = S„(X, Y) - (5 - eT) n (X, Y). 

Here (S — eT)- K (X,Y) denotes the score of the alignment ir aligning X = X\ . . . X n and Y = 
Y\ . . . Y n but when we use the scoring function (S — eT) instead of S. The alignment n is optimal 
for S but not necessarily for (S — eT). Hence S 7r (X,Y) is equal to the optimal alignment score 
L n (S), but (S — eT) n (X,Y) is less or equal to L n (S — eT). This implies that 

(2.8) S„(X, Y)-(S- eT) n (X, Y) > L n (S) - L n {S - eT). 
Combining inequalities 12.61 12.712.81 we obtain 

L n (S) - L n (S - eT) n 



(2.9) E[L n (S)-L n (S)\X,Y}> 



n eiV* 



Note that the right side of the last inequality above converges in probability to 

X(S) - X(S - eT) 
e-Pa 

where p a is the probability of letter a. This already implies that the probability on the left side 
of inequality 12.51 in Theorem 12.11 goes to 1 as n — > oo. The rate like in inequality 12.51 can then 
easily be obtained from the Azuma-Hoeffding Theorem 16.31 given below. Again the details of 
this proof are given in the next section. Next, let us formulate the lemma below which shows, 
that a biased effect of our random letter change implies the desired order of the fluctuation. 
We give the proof because unlike in [12], we also consider the case where we have more than 2 
letters in the alphabet. 

Lemma 2.1. Assume that there exist constants A > and a > such that for all n large 
enough it is true that 

(2.10) P ( E[L n {S) - L n {S)\X, Y] > A ) > 1 - n - al ^ n \ 

Then, we have V AR[L n {S)] = O(n). 
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Proof. Let Nb denote the total number of symbols b in the string X = X\X<i . . . X n and Y = 
Y{Yi . . . Y n combined. (This means that we take the number of b's in X and the number of b's 
in Y and add them together to get Nb). Note that iV& has a binomial distribution with 

E[N b ] = 2 Pb ■ n, VAR[Nf] = 4p 6 (l - Pb )n, 

where p b := P(X { = b) = P(Yi = b). 

Let N a b denote the total number of symbols b and a's in the string X = X\Xi . . . X n and 
Y = Y1Y2 . . . Y n combined. Note that N ab has a binomial distribution with 

E[N ab ] = 2(p a + Pb ) ■ n, 

where p a := P(X{ = a) = P(Y; L = a). 

Next we are going to define a collection of random string-pairs (X(k,l),Y(k,l)) for every 
/ < 2n and k <l. The string-pair (X(k,l),Y(k,l)) has its distribution equal to the string-pair 

(X,Y) = (X X 1 ...X n ,Y 1 Y 2 ...Y n ) 

conditional on Nb = k,N ab = I. Hence, 

C(X(k,l),Y(k,l)) = C(X,Y\N b = k,N ab = I). 

For given I < 2n, we define (X(k,l),Y (k,l)) by induction on k: For this let (X(0,l),Y(0,l)) 
denote a string-pair of length n which is independent of Nb and of N ab . We also, require that 
(X(0,l),Y(0,l)) has its distribution equal to (X, Y) conditional on jVj, = and N a b = I. Then, 
we chose one a at randonQ in (X(0, l),Y(0, 1)) and change it into a b. This yields the string-pair 
(X(l, l),Y(l, I)). Once (X(k, l),Y(k, I)) is obtained, we chose an a at random in (X(k, l),Y(k, I)) 
and change it into a b. This then give the string-pair (X(k + 1, l),Y(k + 1, 1)). We go on until 
k = I. We do this construction by induction on k for every I = 1, 2, . . . , n. 

Now, due to invariance under permutation, we can see that indeed with this definition we 
obtain that 

(X(k,l),Y(k,l)) 

has the distribution of (X, Y) given N b = k,N a ^ b = I. Hence, (X(N b , N ab ),Y(N b , N ab )) has the 
same distribution as (X,Y). So, the optimal alignment score of X(N b , N a b) and Y(N b , N a b) has 



That is, we chose an a at random among all a's in X and in Y with equal probability. 
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same distribution as the optimal alignment score of X and Y. Hence, we also have the same 
variance: 

(2.11) VAR[f(N b , N ab )} = VAR[L n (S)} 

where f(N b , N ab ) denotes the optimal alignment score of X(N b , N ab ) and Y(N b , N ab ). (In other 
words, f(k, I) is defined to be the optimal alignment score of X{k, I) and Y(k, I).) By condition- 
ing, we only can reduce the variance and hence: 

(2.12) VAR[f(N b ,N ab )} > E[ VAR[f(N b ,N ab )\f,N ab ] }. 

Note for any random variable W we have that the variance of W is half the variance of W—W* 
where W* designates an independent copy of W. So, we have 

VAR[W] = 0.5 • E[(W - W*) 2 ]. 

Let us apply this idea to 12.121 For this let Nt be a variable which conditional on N ab is 
independent of N b and has same distribution as iV;,. Hence, we request that for every i < n, we 
have: 

£(NZ,N b \N ab = i) = C(N b \N ab = i) ® £(NZ\N ab = i) 

and 

C(N b \N ab = i) = C(N* b \N ab = i). 
We also assume that N% is independent of /(., .). 

Then, we have that 

(2.13) VAR[f(N b ,N ab )\f,N ab ] = 0.5 • E[ (f(N b ,N ab ) - f(N^N ab )) 2 \f,N ab ] 

Let now C2 > c\ > be two constants not depending on n. We will see later how we have to 
select these constants. Let I n be the integer interval 

I n := [E[N b ] - c 2v ^, E[N b ] + c 2v q . 

Let 

Gj 

be the event that N b and N£ are both in the interval I n . 
Let 

be the event that 

\N b -N£\ >ciy^. 
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Let G n be the event: 

G n := Gj n Gjj. 

Let J n denote the integer interval 

J n := [E[N ab ] - V^,E[N ab ] + Vn|. 
Let K n be the event that N ab lies within the interval J n . 

Let H n be the event that for any I £ J n , we have: for any integers x < y in the interval I n 
which are apart by at least c\\/n, the average slope of /(.,/) between x and y is greater equal 
than A/2, hence: 

f(y,l)-f(x,l) > 

y - x 

Now, clearly when the events G n , H n and K n all hold, then we have 

\f(N b ,N ab ) - f(NZ,N ab )\ 2 > 0.25c? A 2 • n. 

This implies that 

(2.14) E[ E(f(N b ,N ab ) - f(N^,N ab )) 2 \f,N ab }} > P(G n n H n (1 K n ) ■ 0.125c 2 A 2 • n. 
We can now combine equations 12.111 12.12[ 12.131 and 12.141 to obtain 

(2.15) VAR[L n (S)\ > P(G n n H n n i^ n ) • 0.25c 2 A 2 • n. 
and hence 

(2.16) VAR[L n {S)} > (1 - P(G nc ) - P{H nc ) - P{K nc )) ■ 0.25c 2 A 2 • n. 

By the Central Limit Theorem, when taking C2 large enough (but not depending on n), we get 
that the limit lim n _ >00 P{G r j) gets as close to 1 as we want. Similarly, looking at Lemma [2.21 we 
see that taking c± > small enough (but not depending on n), the limit lhm^oo P{G r j I ) gets 
also as close to 1 as we want. Hence, taking c\ > small enough and C2 > large enough, we 
get the the limit for n — > oo of P(G nc ) as close to as we want. By Lemma |2.3[ we know that 
P{H nc ) goes to as n — > oo. Finally by the Central Limit Theorem, the probability P(K nc ) 
converges to a number bounded away from 1 as n — >■ oo. Applying all of this, to inequality 12.161 
we find that for c\ > small enough and C2 > large enough, (but both not depending on n), 
we have: there exists a constant c > not depending on n so that for all n large enough, we 
have 

VAR[L n (S)] > cn, 

as claimed in the lemma. □ 
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Lemma 2.2. It is true that 



P{G n H ) -»• 2P (V(0, 1) > 



ci 



as n — > oo 



Proof. Let c > be constant. Let J n (c) be the interval 

J n (c) = [ £[iV a6 ] - cv ^, £[iV a6 ] + Cv ^ ]. 
Let -fT n (c) denote the event that iV a 5 is in J n (c). Note that by Law of Total Probability: 

(2.17) P(G n H ) = P(G n H \ J n (c))P{J n (c)) + P(G n IH \ J nc (c))P{J nc (c)). 
Now 

(2.18) P{G n II \J n {c))= P(Gh\N ab = k)-P(N ab = k\J n (c)). 

feeJ n (c) 

But conditioning on N a b = k, the variables Nf, and N£ become binomial with parameters 
Pb/(Pa + Pb) an d k. Furthermore, and N£ are independent of each other conditional on 
Nab = k. We can hence apply the Central Limit Theorem and find that conditional on N a b = k, 
the variable iVj, — Nj* is close to normal with expectation and variance 2kq, where q : = 
Pb/(Pa +Pb)- Hence, by Central Limit Theorem, the probability of Gfj, conditional on = k, 
is approximated by the following probability 



P(\M(0,2kq)\ >civ^) = 2P ( AA(0, 1) > ° lv " 



<2kq 



Let us denote by e£ the approximation error, so that 

e n k := P(Gh\N ab = k)-2P (V(0, 1) > ^) . 



When k is in J ra (c), then the expression 



ranges between 



/ 2kq 



Cl 



and 



y/2p b + 2cq/y/n 

ci 

y/2p b ~ 2cq/y/n 
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From this and Equation (|2.18p it follows that 

(2.19) 4-P(Na b = k\J n (c)) + 2P(M(0 1 l)>a n _)<P(G 1 } II \J n (c)) 

fceJ n (c) 

(2.20) < 4-P(Nab = k\J n (c)) + 2P(Af(0,l)>a n + ) 

keJ n (c) 

Assume that n is large enough, (recall that c > does not depend on n), so that the left most 
point of J n (c) is above n(p a + pb)/2. (How large n needs be for this depends on c). Then, 
when k 6 J n (c) we have for n large enough, that k > n(p a + Pb)/2. Note that by Berry-Essen 
inequality we have that 

le»l < ^ 

l€kl ~Vk 

and hence, for all k S J n (c) (provided n is large enough), we find that 
(2-21) |e£| < 4= 

where C,C* > are constants not depending on n. Using (|2.2ip . we can rewrite the inequalities 
given in (|2. 19[) and ([2.20p . and obtain that for all n large enough we have: 

(2.22) - -^L + 2P(AA(0, 1) > a n _) < P{G n HI \J n {c)) < + 2P(AA(0, 1) > aV>. 

Jn Jn 



When n — > oo, we have that a™ and a™ both converge to ci/J2pb and C/Jn goes to 0. Hence, 
we can apply the Hospital rule for limits to the system of inequalities 12.221 and find that 

(2.23) P{G n HI \J n {c)) 2P(AA(0,1) > '' ; ' 



fWb 

as n — > oo. Note that by the Central limit theorem, the probability of J n (c) converges as n — > oo. 
Let e(c) denote the limit 

e(c) = lim P(J nc (c)). 

n— >oo 

Taking the lim sup and lim inf of Equation (I2.17|) and using (I2.23P we get 

(2.24) 2P(AA(0, 1) > -^L=) ■ (1 - e(c)) < lim inf P(G^r) < limsupP(G^) < 

j2pb n-5-oo n-+oo 

(2.25) < 2P(AA(0, 1) > • (1 - 6(c)) + e(c). 

Note that the last two inequalities above hold for any c > not depending on n. Furthermore, 
e(c) — > as c — > oo. So, letting c go to infinity we finally find by l'Hospital rule applied to 12.241 
and [IT25] that: 

P{G n n ) -»■ 2P(AT(0, 1) > n 



iWb 

as n — > oo. □ 
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Lemma 2.3. Assume that Inequality (12. 10f> holds for a > not depending on n. Then, we have 
that 

P(H n ) -> 1 

as n — )• oo . 

Proof. Let Hf(k, I) be the event that the conditional expected change in optimal alignment score 
when we align X(k, I) with Y is at least A. Here we talk about the change induced by switching 
a randomly chosen a into a b in the string X(k,l) or the string Y(k, I). If (X(k,l),Y(k,l)) 
denotes the randomly modified string pair (X(k, I), Y{k, I)), then by our definition of /(., .), we 
have /(fc+lj I) is the optimal alignment score of X{k + 1, 1) &ndY(k + l,l). Furthermore, f(k,l) 
denotes the optimal alignment score of X(k,l) with Y(k,l). Now formally, the event H n (k,l) 
holds when 

E[f(k + l,l) - f(k,l)\X(k,l),Y}> A 

which is the same as: 

E[L n (S) - L n (S)\X = X(k, I), Y] > A 

or equivalently 

(2.26) E[L n (S)-L n (S)\X,Y,N b = k,N ab = I] > A. 

To understand why the last two inequalities above are equivalent, recall that the distribution of 
(X(k,l),Y(k,l)) is the same as the distribution of (X,Y) conditional on iV& = k and = I. 
For the probability of Inequality (|2.26[) above, if we would not have also conditional on iV& = k 
and N a b = I, we would have the bound on the right side of (|2.10p available. By how much can 
a small probability increase by conditing? Let us take any too events A and B. We have 

P(A\B) - P{AnB) < ^ 

So, by conditioning on an event B, the probability of any event A increases by at most a factor 
1/P(B). This leads to 

P(Hf c (k,l)) = P(E[L n (S)-L n (S)\X,Y,N b = l,N ab = I] < A) 
P(E[L n (S) — L n (S)\X,Y] < A) 



(2.27) < 



Let now Hf denote the event: 



P(N b = k,N ab = l) 

Hf = rik£i n ,i£j n H?(k,l) 
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so that 

(2.28) P(H? C )< P(H? c (k,l)). 

By the assumption of the present lemma that is Equation (|2.10p . we have the probability that 
the following inequality holds 

E[L n (S) - L n (S)\X,Y] < A, 

is below n~ an . Also, by the Local Central Limit Theorem, we have that there exists a constant 
c > not depending on n, k or I, so that for all k G I n and I G I n , we have: 

P(N b = k,N ab = l) > -. 

n 

Applying this and condition 12. 101 to inequality 12.271 we find that for k G I n and I G I n , we have: 

P(H? c (k, I)) < n~ an ■ n/c = n~ an+l /c. 
We can now use the last inequality above with inequality 12.281 to find 

(2.29) P(H? C ) < Ac 2 n- an+2 /c, 

where we used the fact that the number of integer couples (A;, I) with k G J n and I G I n is 4c2n. 
Let M(k,l) denote the value: 

fc-i 

Af = E (/(' + X ' - ^[/(« + X ' 0I^(». 0> ^(*, 0]) + /(0, 0- 

8=0 

Clearly when we hold I fixed, then M(., I) is a Martingale. 

Let Hfj(x, y, I) denote the event that we have that 

\M(y, I) - M(x, l)\ < 0.5\x - y\A 
By Hoeffding's Inequality for Martingales, P(Hjj(x,y,l) has high probability, 

(2.30) P(H??(x,y,l)) < 2exp(-0.5A 2 |x-y|/|5| 2 ) 

Here \S\ denotes the maximum change in value of the scoring function when we change one 
letter, 

\S\ = max \S(c, d) — S(c, e)|. 

c,d,e£A* 

Note that when we change only one letter in a string then the optimal alignment score changes 
by at most \S\. Since, to obtain f(k + 1, 1) from f(k, I) we change only one letter, we have that 
\f(k + 1,1) - f(k,l)\ < \S\ always. This also implies that \M(k + 1,1) - M(k,l)\ < \S\ always, 
which is what we used to apply Hoeffding inequality. 
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Now, let 

Hfi 

denote the event that Hfj(x,y, I) holds for all x < y with \x — y\ > c\yfn and x,y 6 J n and 
I £ I n . Then 

(2.31) P(H??)< Yl Pm(x,y,l)) 

x,y£j n ,l£l n 

where for the sum on the right side of the last equation above is taken over \x — y\ > c\y/n. The 
number of triplets (x,y,l) in the sum on the right side of 12.311 is less than &(3,n 1 ^ . This bound 
together with (|2.30p implies 

(2.32) P(Hff) < 16c|n L5 exp(-2A 2 V^/|5| 2 ) 
Note that 

fc-i 

f(k,l) = Af(M) +£>[/(* + U) - f(i,l)\X(i,l),Y(i,l)] 



i=0 



so that 



(2.33) f(y, I) - f(x, I) = M(y, I) - M(x, I) + £ E[f(i + 1,1)- f(i, l)\X(i, I), Y\. 

i=x 

Assume now that / G J n . Then, when the event Hf holds, the sum of conditional expectations 
on the right side of Equation (|2.33p is at least \y — x\A. Furthermore when the event Hfj holds 
and \y — x\ > c\^fn, then 

\M(y,l) -M(x,l)\ <0.5A\x-y\. 
It follows looking at 12.331 that when both Hf and Hfj hold, and y — x > ciy/n, that 

f(y,l)- f(x,l)>0.5\x-y\A 

This is the condition in the definition of the event H n . Hence, we have that Hf and Hf 1 together 
imply H n : 

Hf n Hf T C H n 

and hence 

(2.34) P(H nc ) < P{Hf c ) + P{Hff). 

From the bounds (i!T3"2]) and §Z7M it follows that P(Hf c ) and P{Hff) both go to as n oo. 
So, because of Equation (|2.34p . we find that P(H nc ) also goes to as n — > oo. This concludes 
the proof. □ 
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According to Theorem 12. 11 we have that X(S) — X(S — eT) > implies a positive biased effect 
of the random change on the optimal alignment score. But by lemm a r 27L| a positive biased effect 
on the optimal alignment score implies the fluctuation order: 

(2.35) VAR[L n ] = B(n). 

Hence, inequality X(S) — X(S — eT) > implies the fluctuation order given by equation 12.351 
This is the content of the next theorem: 

Theorem 2.2. Let S : A* x A* — > R be a scoring function on the finite alphabet A. Let 
T : A* x A* ->• R be defined as 

T(a, c) = T(c, a) := S(b, c) — S(a, c) 

for any c £ ^4* with c ^ a and T(d,c) = whenever d ^ a. Furthermore, let T(a,a) = 
2(S(b,a) - S{a,a)). Lete>0. If 

(2.36) \{S)-\(S-eT) > 0, 
then 

(2.37) VAR[L n (S)} = 6(n). 



Proof. When 

(2.38) X(S)-X(S-eT) > 0, 

Theorem 12.11 shows that with high probability the random change has a biased effect on the op- 
timal alignment score. By Lemma [2. 11 this biased effect then implies the order of the fluctuation 
(I2.37p . Let us present further details about this argument: Theorem 12.11 implies that Inequality 
(12. 5h follows from (12.38f) . Let 5 > be taken as follows, 

A(5) - X(S - eT) 

so that Inequality (|2.5p becomes 

(2.39) P (E[L n {S) - L n (S)\X,Y] > X{S) - gT) ) > 1 - n~ an . 

Since, X(S) — X(S — eT) is strictly positive, Lemma 12.11 implies then the desired order of 
fluctuation, that is: 

VAR[L n (S)] = Q(n). 

We have thus shown that condition (I2.36P implies (12.37j) . □ 
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In many situations the last theorem is very practical tool for verifying the fluctuation order 
(|2.37p . By Montecarlo simulation we can now estimate the value for X(S) and \(S — eT) and test 
the positivity of the quantity X(S) — X(S — eT) at a given confidence level /3. In case it is positive 
on the chosen confidence level, it follows from Theorem 12 . 2 1 that we will also be /3-confident that 
the fluctuation order (|2.37p applies. In other words, we check if Inequality (|2.36p holds at a 
certain confidence level that will in practice depend on the available computational power. In 
this fashion we can verify for many scoring functions that VAR[L n (S)] = 0(n) up to a certain 
confidence level! 



3. Proof of Theorem 12.11 

In order to prove Theorem 12.11 we need to show that as soon as 

X(S) - \{S - eT) > 

holds, we get with high probability a positive lower bound for the expected effect of the random 
change of one letter onto the optimal alignment score. That lower bound for 

E[L n (S)-L n (S)\X,Y] 

is as "close as we want" (but maybe sligthly below), the following expression, 

X(S) - X(S - eT) 
e-Pa 

To prove this, we introduce three events A n (S), B n (S) and C^5). We then show in Lemma 13.11 
that the three events A n (S), B n (S) and C n (5) mutually imply the desired lower bound on the 
expected change in optimal alignment score. We then go on to prove that the events A n (S), 
B n (S) and C n (S) all have high probability. This then implies that our lower bound for the 
expected change in optimal alignment score must also hold with high probability. 

So far we traced out a way to prove Theorem 12.11 Let us now look at the details: Let A n (S) 
be the event that 

L n(S) > ^ ln(ra) 



Let B n (S) be the event that 



n wn 



Ln{S ~ CT) < X(S - eT) + ln(n 



n \ n 
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For any number 8 > 0, let C n (5) be the event that 

N"' 8 Inn 

— <Pa + 



n y/n 
where as before p a is the probability: 

p a := P(Xi =a)= P(Yi = a). 

The main combinatorial idea in this paper is given below. It shows that the events A n (S), 
B n (S) and C n (5) together imply the desired lower bound on the expected change of the optimal 
alignment score when we change an a into b: 

Lemma 3.1. Let e > be a constant, and assume that 

X(S)-X(S-eT) >0. 

Let 8, S\ > be any two small constants not depending on n. When A n , B n and C n {5\) all hold 
simultaneously, then for all n large enough, we have: 

E[L n (S) - L n (S)\X, Y] > X{S) ~ X{S ~ Te) - 5. 

ep a 

(How large n needs to be for the above inequality to hold, depends on e,8,8i,p a ). 

Proof. Assume that A n (S) holds. Then, any optimal alignment ir of X = X\ . . . X n and Y = 
Y\ . . . Y n satisfies 

(3.1) ^>A(5) ln(n) 



n \ n 



When B n holds, then 



(3.2, ' S - er '"<A(5- t T) + ^>. 

By linearity, however 

(S-eTK = S:-eTZ. 
The last equation together with inequality 13.21 leads to: 

(3.3) <A(5- £ r) + '°'" ) 



n y/n 

Subtracting Equation (|3.ip from (|3.3p . we find 

X(S)-X(S-eT) _ 21n(n) < 

e e\fn ~ n 
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Now from Equality (|2.4p . we know that when changing a randomly chosen a into a b, the expected 
effect onto the alignment score of ir is T£/N%. (Here iV™ denotes the total number of a's in 
the string X1X2 ■ ■ ■ X n and Y\ . . . Y n combined). Since tt is an optimal alignment according to 
the scoring function S, the expected increase of the alignment score of tt is a lower bound for 
the expected increase of the optimal alignment score. Hence, the expected increase in optimal 
alignment score is at least T^/N^. (We don't necessarily have equality for the change in optimal 
alignment score, but only a lower bound. The reason is that we could have another alignment 
which becomes optimal after we change a letter.) So, since T™/N™ is a lower bound for the 
expected increase in optimal alignment score, multiplying Inequality (|3.4p by n/iV™, we obtain 
the following lower bound on the expected alignment score change, 

(3.5) E[US) - L n (S)\X, Y] > 2L . ^(S)-X(S-eT) _ 2^ 

When the event C n (5i) holds, we find that: 



n 1 1 

> 



~ Pa 1 + «1MS1 
PaVn 

which we apply to Inequality (|3.5p to obtain: 

E[MS )-M S )l*,r]>' A < s >- A < s - T > 21 "<"> 



p a e ep a yjnj \ i . hWri 

\ Pa V n 

From the last inequality above it follows by continuity, that for all n large enough 

E[L n {S) - L n (S)\X, Y] > X{S) ~ X{S ~ eT) - 5, 

e • Pa 

as soon as 5 > does not depend on n. We used the fact that e > 0, 8\, S and p a do not depend 
on n. (So how large n needs be depends on e, 5, 5± and p a ). □ 



In the next lemma we prove that the event A n (S) has probability close to 1, when n is taken 
large: 

Lemma 3.2. For all n large enough, we have that 

P(A n (S)) > 1 - n - ailn(n) , 
where a\ = 1/(8\S\ 2 ), and \S\ := max Ci ^ eg _4* \S(c, d) — S(c, e)| . 
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Proof. Note that by Lemma l6T2| there exists a constant c > not depending on re, such that for 
all re large enough the following inequality holds: 



X(S) - X n (S) < c ^ 



Hence, 



(3.6) A(5)-A TO (g)-^< C ^y -^#< °- 51n(n) 



're y/n yre v n 

where the last inequality above holds for n large enough. Now the event A n (S) holds exactly 
when the following inequality is true: 

(3.7) ^ > X n + (X(S) - X n (S)) HU) 



n y/n 
The very right side of inequality 13.61 is an upper bound for expression 

X(S)-X n (S)-^-. 

y/n 

In an inequality giving a lower (non-random) bound for a random variable, when you replace 
the lower bound by something bigger, the probability (of the inequality) increases. Hence the 
probability or Inequality (|3.7|) . is bigger than the probability of 

(3.8) ^>K, a51 " <n) 



re yjn 

This means, that since Inequality (13. 7\i is equivalent to the event A n (S), that 
(3.9) P(A^S))>p(^l>X n °- 5Hn] 



re y/n 

We can now apply McDiarmid's Inequality - see Lemma 16.31 - to the probability on the right- 
hand side of the last inequality to find 

(3.10) P >X n - =P( Ln (S) - E[L n (S)\ > -(2n)A) > 

(3.11) > 1 -exp(-(2n)A 2 /|^| 2 ) 

where A = 0.25 ln(n) / y/n. We remark that McDiarmid's Inequality is applicable because L n {S) 
depends on 2re i.i.d. entries with the property that changing only one entry affects L n (S) by at 
most \S\. 

With our definition of A we find that the expression on the very right of Inequality (13. lip is 
equal to 

(3.12) exp(-(2re)A 2 /|5| 2 ) = exp(-(ln(n)) 2 /8|S| 2 ) = n - ailn(n) 
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where «i = 1/(8|S"| 2 ). The three equations (pTT2)> . (|3TTTjl and ([33]) jointly imply 

P(^ n (5)) > l-n" ail,1 W 

where a > is defined by: 

1 



□ 



The next lemma shows the high probability of the event B n (S). 
Lemma 3.3. for all n large enough, the following bound holds, 

P(B n (S)) > l-n~ a2n 
where ct2 := I /a 2 and a := max. C: d, e eA* \S(c, d) — S(c, e) + eT(c, d) — eT(c, e)\. 

Proof. A simple subadditivity argument shows that 

(3.13) A n (S-eT) < X(S-eT). 

If we change in the definition of the event B n (S) the upper bound by something smaller, we get 
a lower probability. Hence, because of inequality 13.131 we obtain that 

(3.14) P(B n (S)) >p( Ln(S ~ eT) < X n (S - eT) + ln(n 



n \/n 

The right side of equation 13.141 is equal to 

(3.15) P (L n (S - eT) - E[L n (S - eT)] < (2n)A) 

where 

A _ ln ( n ) 



2v^' 

We can apply McDiarmid's Inequality - see Lemma 16.31 - to the probability given in 13.151 We 
find that 13.151 is greater or equal to 

(3.16) 1 - exp(-2(2n)A 2 /a 2 ) = 1 - exp(-(ln(n)) 2 /a 2 ) = 1 - n - x < n ^ a2 

where a 2 is equal to l/o2- The constant ct2 is defined in the statement of the lemma. 

Combining (13. 16ft . fj3. 151) and (|3.14p . we finally obtain the required inequality 

P(B n (S)) > 1 -n" a2ln ( n ). 

□ 
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The next lemma shows that the event C n (5) holds with high probability. 
Lemma 3.4. Let 5 > be a constant. We have that 

P(C n (5)) > 1 -n~ 21nn . 

Proof. The event C n (5) is equivalent to the following inequality: 

K - E[N2\ < A • n 

where 

Inn 

A := 

in 



by McDiarmid's Inequality, we thus have 

P(C n (5)) > 1 - exp(-2A 2 ■ n) = 1 - n~ 21nn , 
as claimed. □ 

Let 5 > not depend on n. Lemma [3.11 shows that when the events A n (S), B n (S) and C n {8) 
jointly hold, then for n large enough, we have: 

(3.17) E[L n (S) - L n (S)\X, Y] > HS)-X(S-Te) _ § 

Hence, Equation (I3.17P holds with high probability, because the events A n (S), B n (S) and C n (5) 
all hold with high probability. More precisely, we get: 

(3.18) P ( E[L n (S) - L n (S)\X, Y] > A(S) ~ A(S ~ Te) - s) > 

V ePa ) 

(3.19) > 1 - P{A nc (S)) + P(B nc (S)) + P{C nc {8)) 
But, by the last three lemma's above, the sum of probabilities 

P{A nc {S)) + P{B nc {S)) + P{C nc {5)) 

is bounded from above by 

n -ai ln(n) _|_ n ~a 2 ln(n) _|_ n -21n(n) 

which for n large enough is bounded from above by 

n -a\n(n) 
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where a > is any constant not depending on n and strictly smaller than ce\, a<i and 2. So, 
from Inequality (13.18H . we obtain that for all n large enough: 

P ( E[L n (S) - L n (S)\X,Y] > X(S) ~ X(S ~ Te) - j) > 1 - n-» 

where a > does not depend on n. This completes the proof of Theorem 12. 1[ 



4. The case with the 4 letter genetic alphabet 

Changing a C or G into A or T: We consider here the genetic alphabet {A,T,C,G}. In this 
case A and T can mutate easily into each other. Same thing for C and G. But to go from one 
of these two groups into the other is more difficult. This implies that when we want to change a 
letter from the group {A,T} into a letter from the group {C, G}, we get more heavily punished 
by the score. Furthermore, in the humane genome the letters A and T have higher frequency 
than C and G. We still take X = X\X<i . . . X n and Y = Y{Yi . . . Y n to be i.i.d. sequences. We 
consider a model where the probabilities of A and T are equal to each other so that 

P(Xi = A) = P(Yi = A) = P(X t = T) = P(Yi = T) 

and the probabilities of G and C are equal to each other: 

P(Xi = C) = P{Y t = C) = P(Xi = G) = P{Y t = G). 

The random change we consider consists in choosing at random a C or a G and changing it into 
a A or a T. For this we pick among all the C's and G"s within X and Y one at random with 
equal probability. Then, we flip a fair coin to decide if the randomly chosen letter becomes a A 
or a T. Finally we chose the randomly picked letter into a A or a T depending on the coin. The 
new strings obtained from this one letter change are denoted by X and Y. Hence, there is only 
one letter changed when going from XY to XY. This letter is a C or a G which was turned 
into a A or a T. 

Again, we denote by L n (S), the optimal alignment score of X and Y according to S, 

L n (S) := max S n (X,Y), 

where the maximum above is taken over all alignments with gaps tt of X with Y. The conditional 
expected change, as before, is the alignment score of a scoring function T, which has to be defined 
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sligthly differently from the previous case. We take T as follows, for U being equal to C or G 
and V S {A, C, G, T, g}, we define first Tx, 

T X (U, V) := 0.5(5(A, V) - S(U, V)) + 0.5(S(T, V) - S(U, V)). 

When U is not equal to C or G, then let T X (U, V) := 0. 

Similarly, we define Ty by 

T y (V, U) := 0.5(S(V, A) - S(V, U)) + 0.5(S(T, T) - S(V, U)), 

when U is equal to C or G and V G {^4, C, G, T, 5}. Otherwise, we take Ty := 0. Finally we 
define T as the sum of Tx and Ty: 

T = T X + T Y . 

With this definition of T, the conditional expected change in alignment-score S equals the 
alignment score of T up to a factor. This is the same principal as the one leading to Equation 
(|2.4p . Hence, for any alignment ir of X and Y, the following holds true, 

(4.1) E[SAY,X) -S n (X,Y)\X,Y] = T ^ Y \ 

^C,G 

where Nc,g represents the total number of C and G's present in both X and Y. As usual, 
T n (X,Y) represents the score of the alignment ir, when using the scoring function T instead of 
S. Also, 7r is supposed to align X = X1X2 ■ ■ ■ X n with Y1Y2 . . . Y n . 

Note that as n — > 00, we have 

n 1 1 



Nc,G 2{pc + Pg) ^Pc 
Hence, in Theorem 12. II in equation 12.51 we need to replace p a by 2{pc +Pg) where p c := P{X, L = 
C) = P(Yi = C) and p G = P{X t = G) = P(Y = G). 

With these notations, Theorem 12.11 and Lemma 12.11 remain valid provided we change p a 
by 2(pc + p G ) i n equation 12.51 In other words, in this case also we just have to verify that 
X(S) — X(S — eT) > to get the variance order 

VAR[L n {S)\ = 9(n). 

Theorem 12.11 is proved the same way as in the previous case. So, we leave it to the reader. 
The only change is that we start with Equation (|4.ip . rather than (|2.4p . Then one can follow 
the same steps. For Lemma 12.11 the situation is easier than is is with the change a — > b in an 
alphabet with more than 2 letters. Actually, the proof is very similar to the one done in |12| . We 
thus only outline the proof: when we look at the proof of Lemma [2. 1\ we have two variables: iV a & 
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and Nfj. In that proof, we condition on N a b and let iV& vary to proof the fluctuation order. For 
the genetic alphabet case, we don't need two variables but only one. So, N a ,t will denote the 
total number of C and G's counted in both the string X and Y . This variable Nc,g corresponds 
to Nb in the other case). There is no need of another variable (like Nab). So, we will generate a 
random sequence of string-pairs: 

(X(0), Y(0)), (X(l), Y(l)), . . . , (X(k), Y(k)), (X(2n),Y(2n)). 

The sequences X(0) and Y(0) are i.i.d sequences independent of each other which contain only 
the letters G and G. Those letters are taken equiprobable. Then we chose any letter and change 
it into an A or a T. To decide whether it is A or T we flip a fair coin. We proceed by induction 
on k: once (X(k),Y(k)) is obtained, we chose any G or G in X(k),Y(k) and change it to A or 
T. Among all C and G's in both strings we chose with equal probability. In other words we 
apply the random change ~. This means that our recursive relation is: 

(X(k + l),Y(k + l)) = (X(k),Y(k)). 

Note that with this definition, the total number of A and T's in X{k) and Y{k) combined is 
exactly k. Given, that constrain, all possibilities are equally likely for (X(k),Y (k)). This is 
to say, that the probability distribution of (X(k),Y(k)) is the same as (X,Y) conditional on 

N AtT = k: 

C(X(k),Y(k)) = C(X,Y\N AT = k). 
So, if we produce the string-pairs (X(k), Y (k)) independently of Na,t, then we obtain that 

(X(N A:T ),Y(N A:T )) 

has the same distribution as (X, Y). So, among other, the fluctuation of the optimal alignment 
score must be equal as well 

(4.2) VAR[S(X(N AT ),Y(N AT ))} = VAR[S(X, Y)} = VAR[L n {S)\. 

(Here S(X(N Aj t),Y(N a ^t) denotes the optimal alignment score of the strings X(N A ^) and 
Y(N A) t)- Similarly S(X, Y) denotes the optimal alignment score of X and Y .) so, if we denote 
S(X(k),Y(k)) by f(k), equation becomes 

(4.3) VAR[f(N AT )} = VAR[L n (S)}. 

Now, assume that the random change has typically a biased effect on the alignment score as given 
in Equation (pnoj) in Lemma[2Tj We have that f(k + l) is obtained from f{k) = S{X(k),Y(k)) 
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by applying the random change. So, if (12.100 holds, that that expected random change typically 
should be above A > 0. So typically, 

E[f(k + 1) - f(k)\X(k),Y(k)} > A 

where A > does not depend on k. In other words, /(.) behaves "like a biased random walk". 
And on a certain scale, has a slope which , with high probability is at least A. But, assume 
that g is a non-random function with slope at least A Then for any variable N, it is shown in 
[5] that 

VAR[g(N)) > A 2 VAR[N] 

So, we can apply this to our case, Take g equal to / and N equal to N A q. We get that when 
Inequality (I2.10j) holds, then 

(4.4) VAR[f(N AT )} > A VAR[N AC ] = A 2 4ncp AC (l - PAc) 

where c > is a constant not depending on n. Here, the constant c had to be introduced, 
because / is random and is not everywhere having a slope of at least A but only with high 
probability and on a certain scale. We also used the fact that N A c is a binomial variable with 
parameters 2n and P(Xi € Combining now (|4.4p with (|4.3p . we finally obtain the 

desired result 

VAR[L n (S)) > A 2 4nc PAC (l - p AC ) 

and hence 

VAR[L n {S)\ = 9(n). 

5. Determining when \(S) - \(S — eT) > using simulations 

Recall that X1X2 ■ ■ ■ X n and YyY^ . . . Y n are two i.i.d. sequence independent of each other. 
Also recall that 

L n (R) 

designates the optimal alignment score of X\ . . . X n and Y\ . . .Y n according to the scoring func- 
tion R. Furthermore, we saw that L n (R)/n converges to a finite number as n — > 00 which we 
denote by Xr, so that 

A fi := Urn ^ 

n— i>oo 71 



We know by Theorem 12.21 that when 

(5.1) X(S) - X(S - eT) > 0, 
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the fluctuation of the optimal alignment score is linear in n, that is, 
(5.2) VAR[L n {S)} = @(n). 

So, we can run a Montecarlo simulation, and estimate the quantity on the left-hand side of (|5.ip . 
If the estimate is positive, this is an indication that the left side of 15.11 is positive too and that 
(|5.2p holds. We can even go one step further and actually test on a certain significance level if 
inequality (|5.ip is satisfied. If it is on a significance level /3 > 0, we are then /3-confident that 
the order of the fluctuation is as given in inequality (|5.2p . In this way, we are able to verify up 
to a certain confidence level that the fluctuation size of the optimal alignment score is linear in 
n. We manage to do so for several realistic scoring functions. 

To estimate the expression on the right-hand side of (|5.ip . we simply use (L n (S) — L n (S — 
eT))/n. (Note that as n goes to infinity our estimate goes to X(S) — X(S — eT).) To do this, we 
draw two sequences of length n at random: 

X = X\ . . . x n 

and 

Y = Y 1 ...Y n . 



We then take the optimal alignment score of X and Y according to S which is L n (S). Next, we 
calculate the optimal alignment score of X and Y according to S — eT which yields L n (S — eT). 
Finally, we subtract the two and divide by n so as to get our estimate of the left side Inequality 

4HD, 

(5.3) \{S) - \{S - eT) = L n(S)-L n (S-eT) _ 

n 

When our estimate is positive, it makes it seem likely that Inequality (15. ip is satisfied. We need 
to ask ourselves however how big the estimate needs to be, to guarantee that (|5.ip holds up to 
a high enough confidence level. 

When our estimate is positive, we determine at which confidence level (|5.ip holds. Assume 
that the value reached by our estimate is x. (So, after one simulation, x designates the numerical 
value taken by (|5.3p .) For the confidence level, we need an upper bound on the probability that 
the estimate reaches the value x if in reality As — Xs-eT was negative. The confidence level is 
then, one minus this probability. 
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Let us go through the calculation. First we denote by E n the following expectation: 

E[L n (S)} - E[L n (S - eT)} 

h, n :— . 

n 

We have that 

p / Ln (S) - L n (S - eT) ^ x 
\ n 

= J ,/ M 5)-L.(5-«D 
\ n 

(5.5) < P ( Lw(g) ~ L f S ~ tT) - En >x-E n + (A(5) - X(S - eT))\ , 

where the last inequality above was obtained because we make the assumption that X(S) — X(S- 
eT) < 0. Now, 

(5. 6 ) _ En + (X(S) - X(S - eT)) = X(S) - - ( X(S - eT) - Ln{S — ' eT) 

n \ n 

by subadditivity we have that 

(5.7) X(S) - > 0. 

n 

In the appendix, Lemma 16.21 allows us to bound from above the quantity: 

L n (S - eT) 



X(S - eT) 



n 



by the bound: 



(5.8) c n \S-eT\.^ ] 



n 



where 



'21n3 + 21n(n + 2) 



V ln(n) 

2| o| 

(Note that we leave out the term 1 J* which appears in inequality 16.31 This term is of an order 
to small to be practically relevant.) Using now the upper bound 15.81 and inequality (15. 7\i with 
(jOj) in ([53]) and (I53]) . we finally find 



(5.9) P |jj^M1>, 



ip lL n{ S)-L n{ S-tT)_ E ^ x _ VEW 



n 
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We can now use Azuma-Hoeffding Inequality (see Lemma 16.31 in Appendix) to bound the prob- 
ability on the right side of inequality 15.91 As a matter of fact, when we change one of the 2n 
i.i.d. entries (which are X\...X n and Y\ . . . Y n ), the term 

L n (S) - L n (S - eT) 

changes by at most a quantity 

\S\ + \S-eT\, 

where, as before, \R\ denotes the msaximum change in aligned letter pair score when one changes 
on letter with a scoring function \R\, 

\R\ := max \R(c,d) - R(c,e)\. 

c,d,e£A* 

So, applying Lemma 16.31 to the right side expression of (|5.9p . we find 

(5.10) P ( Ln{S)-Ln{S-eT) ^ ^ e xp(-nA 2 /(|S| + \ s _ eT \f^ 

where 

A = x-c n |5- e ^|.^ y ^ 



>n 

One minus the bound on the right side of 15. 1UI is how confident we are that X(S) — X(S — eT) is 
not negative. Of course, for this to make sense, we need to to first check that the value of the 



estimate x is above c n \S — eT\ ■ yj]n{n) / y/n. 

In what follows, S refers to the substitution matrix: 

(S(i,j))i,jeA, 

which is obtained from the scoring function S. (Basically the matrix S, is just a way of writing 
the scoring function S : A x A — > R in matrix form.) Also, in all the examples we investigated 
we took the gap penalty to be the same for all letters: this means that aligning any letter with 
a gap has the same score not depending on which letter gets aligned with the gap. We denote 
by 5 the gap penalty, that is 

5 := -S{c,G) 

where the expression on the right side of the above equality in the situation examine numerically 
in this paper does not depend on which letter c E A we consider. ( Recall that G denotes the 
symbol used for a gap). 



Let us quickly explain the situation for which we verified through Montecarlo-simulation that 
with a high confidence level X(S) — X(S — eT) > for a e > 0: 
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(1) The first situation is the same as the first except that we we change a into 1 in the 
sequences X and then another into 1 in Y. So the random change consists of two 
letters changed. This then yields the matrix T to be 



T 2 :-- 



everything else remains the same. 
(2) Another situation is the DNA-alphabet {A, T, C, G}. In this case A and T can mutate 
easily into each other. Same thing for C and G. But to go from one of these two groups 
into the other is more difficult. This implies that when we want to change a letter from 
the group {^4, T} into a letter from the group {C, G}, we get more heavily punished by 
the score. This can be seen the default substitution matrix used by Blastz: 



/ 



Sblastz = Sbl 



\ 





A 


T 


C 


G 


A 


91 


-31 


-114 


-123 


T 


-31 


100 


-125 


-114 


C 


-114 


-125 


100 


-31 


G 


-123 


-114 


-31 


91 



In humane genome the letters A and T have higher frequency than G and C. We took 
A and T together to both have frequency 0.4 and G and C to each have frequency 0.1. 
With these choices and a gap penalty of 800 we obtained the desired result. The random 
change for this is defined as follows: 

we pick one C or G in any of the two sequences X and Y. That is we consider all C's and 
all G's appearing in both X and Y and with equal probability just chose one such letter. 
Then we flip a fair coin to decide if we change that symbol into a A or a T and then 
do the change accordingly. The new strings are denoted by X, resp. Y. The difference 
between XY and XY is exactly one C or G which got turned into a A or a T. 
The random-change matrix T in that case is equal to: 

/ 



Tblastz = Tbl = 



V 





A 


T 


C 


G 




A 








144 


153 




T 








159.5 


148.5 




C 


144 


159.5 


-439 


-176 




G 


153 


148.5 


-176 


-419 


/ 



Note that the random change described here tends to increase the score since C and G 
are likely to be aligned with A or T since there are more A and T's... The BLASTZ 
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default gap penalty is 400, but for significantly determining that 15.11 holds, we need a 
higher gap penalty 5 of 1200. 

Let us summarize what we found in our simulations: 



Case 


I 


II 


Alphabet 


{0,1} 


{A, T, C, G} 


P(-) 


p = 0.2, Pl =0.8 


p A = 0.4,p T = 0.4, po = 0.1, p G = 0.1 


S 


id 2 


Sbl 


T 


T 2 


T BL 


5 


6 


1200 


n 


10 5 


2 x 10 5 


€ 


0.5 


0.9 


n 


0.0634 


15.197 


p- value 


0.0102 


2.4 x 10" 4 



In the table above, L n designates our test statistic, 

_ L n (S) - L n (S - eT) 

J-'n — j 

n 

and 5 denotes the gap penalty. Now, the algorithm to find the optimal alignment score of 
two sequences of length n is of order constant times n 2 . So, our simulation to obtain L n with 
n = 100000 ran overnight, but if one has more time, one could run longer sequences and get 
even better results. For example, we use the actual default matrix for BLASTZ, but then our 
gap penalty is 1200 whilst the default is only 400. In reality, when doing the simulations with 
say a gap penalty of 600 one always get L n to be positive. But not positive enough to beat the 
theoretical our bound for the difference between E[L n ]/n and the limit X(S) — X(S — eT). Now, 
there are known methods [7],[T5], [S], [9], to find confidence bounds for X(S) which are way better 
than what we use here. (In this paper we simply simulate two long sequences X = X\ . . . X n 
andy = Yi . . .Y n and then compute the optimal alignment scores for S and S — eT. The 
difference of the scores leads than to L n .) So, using some of these advanced methods or running 
very long simulations, clearly in our opinion will allow for proving the order 

(5.11) VAR[L n {S)] = Thetain) 

for even "less extrem" situations. For example, we expect that if the gap penalty is 600 instead of 
1200 we still should manage to show [5"TTl Also, when the probabilities are even less biased, say 
0.2, 0.2, 0.3, 0.3 instead of 0.1, 0.1, 0.4.0.4. Non the less, what we achieve in this article is already 
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quite remarkable, considering that in the article [], it takes for binary-sequences, the probability 
of 1 to be below 10 -12 for the technique to work!! Compare this with the probabilities in this 
paper of P(Xi = 1) = 0.2, P{Xi = 0) = 0.8 for which we are able to show that 15.111 holds up to 
a high confidence level! 



6. Appendix: Large Deviations 

We denote by Ls{x\ . . . X{, y% . . . yj) the optimal alignment score of the strings x\ . . . Xi with 
yi . . . yj according to the scoring function S. Also, recall the definition given in the first section: 
L n (S) := Ls(Xi . . . X n , Y\ . . . Y n ) and X n (S) := E[L n (S)]/n. Furthermore, recall that A n (5) — >• 
\(S). In this appendix we will show a stronger result that quantifies the convergence rate as 
being of order 0(^/lnn/n). For this purpose, we introduce the following notation, 

\\S\\s = max \S (c, d) — S (c, e)| , 

||iS'|| 0O = max \S(c,d)\, 

Lemma 6.1. Let x = x\...x m and y = y\...y n be two given strings with letters from the 
alphabet A, and let S be a given scoring function. Let further x £ A, and consider two amend- 
ments of string x, x^ = x\ . . . x%-\ xxi+i . . . x m , obtained by replacing an arbitrary letter X{ by 
x, and = x\ . . . x m x, obtained by extending x by a letter x. Then the following hold true, 



(6-1) L s (xW,y)-L s (x,y) 
(6.2) L s (xM,y)-L s (x,y) 



< 11%, 
— I \£>\ loo- 



Proof. Let 7r be an optimal alignment of x and y, so that S 7T (x,y) = Ls(x,y), and denote the 
letter with which Xi is aligned under tt by a £ A*. Then 

L s (x^,y) > S v {x®,y) = Sn(x,y)-S(xi,a) + S(x,a)>Ls(x t y)-\\S\\ s . 

Applying the identical argument to an optimal alignment of and y, we obtain the analogous 
inequality 

L s (x,y)>L s (xW,y)-\\S\\s, 

so that (16.11) follows. 



For the second claim, let us use an optimal alignment tt of x and y to construct an alignment 
7rM of xW and y by appending an aligned pair of letters (x, G), where G denotes a gap. Then 
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we have 

L s (x l+ \y) > S nl+] (xM,y) = S v (x,y) + S(x,G) > L s (x,y) - \\S\U 

Conversely, we can amend an optimal alignment ttW of a;M and y to become a valid alignment 
7r of x and y by cropping the last pair of aligned letters, (x, a). We then have 

Ls(x,y) > S^(x,y) = S-[ +] (x [+] ,y) - S(x,a) > L s (x [+] ,y) - \\S\loo, 

thus establishing (|6.2p . □ 

Lemma 6.2. The convergence of \ n {S) to \{S) is governed by the inequality 

(6.3) 

where 

/21n3 + 21n(n + 2) 

Note that c n tends to v2 when n — > oo, so that it effectively acts as a constant. 



US) < A(S) < An(5) + c n ||5|| 5 ^ + Vn G N, 



Proof. Let k,n £ ~N, m = k x n, and let P m>n denote the set of all pairs (r, s) of 2k dimensional 
integer vectors f = [n ■■■ r 2 k] T G Np fc and s = [si ... s 2k )] T G Np fc that satisfy — rj_i + s« — 
s i-i G {n — l,n, n + 1} for i = 1,2, .. . , 2fc, as well as = ro < ri < • • • < r2fc = to and 
= s < si < ■ ■ ■ < s 2 k = m. 

For (r, s) G V m ^ n , let L m (S,r, s) denote the sum of optimal alignment scores 



2/,- 



(6.4) 



L m (S, f, s) := L s {X n _ 1+ i . . . X n ,Y Si _ 1+ i . . . Y Si ). 



i=l 



Thus, L m (S, f, r) is the optimal alignment score with the additional constraint that X Ti 
be aligned with . . . Y Si for i = 1, 2, . . . , 2k. 



X r 



Note that for L m (S)/m to be larger than x, at least one of the L m (S ', r, s) / 'to would have to 
exceed x. The following inequality holds therefore for all x G N, 



(6.5) 



L m (S) 



> x 



m 



(r,S)eVrn,n 



L m {S, t, s) 



> X 



rn 



Lemma [6.1l shows that a change in the value of any one of the 2m i.i.d. variables X\ , . . . , X m , Y\ , 
after sampling them - whilst leaving the values of the remaining variables unchanged - causes 
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the value of L m (S, s, f) to change by at most \\S\\$. Lemma HTBI thus implies that for any A > 
we have 

mA 2 



P [L m (S, f,s) - E [L m (S, f, s)] > mA] < exp 

2||S|U 



(6.6) 

Furthermore, Lemma [6.41 will establish that 

E[L m (S,r,s)] 



I QII2 
\ a \\S 



m 



< A n (5) + 



I? 



so that we have 

L m (S, r, s) 



m 



>X n (S)+ 2 -^ + A 



n 



< P [L m (S, r,s)-E [L m (S, r, s)} > mA] 
mA 2 ' 



< exp 



\\S\\j 



Substituting this last bound into (JB^J) with x = X n (S) + + A, we obtain 

,2 

< [3(n + 2)] 2fc exp 



^l>X n (S) + ^^ + A 



m 



n 



mA 2 



\S\ 



2 [ ' 

8 



where we used the observation that (Pm^l < [3(n + 2)] . 



Next, fix a constant c and let A = c/y/n. Substitution into the last estimate yields 

„2 



(6.7) 



> Xn (S) + + 4= 

m n vn 



< exp < —k 



\\S\\i 



di 



where d n = -y/2 ln(3) — 2 ln(n + 2). Setting z := c — dnll^lU and 



Z m 

(|6.7p can be expressed as 



Lm(,S) A n (5)- 2||S||o ° 



dn||5|| a 



P [Z m >z]< exp <^ —k x 



z 2 + 22<Z n ||S|| 

For z > 0, the right-hand side can be bounded by the quadratic term alone, 

kz 2 



\\S\\l 



P [Z m >z]< exp 
This yields a bound on E[Z m ], 

E[Z m ]<J P[Z m >z]dz<J expj- 
and taking k — > oo, we find 

(6.8) limsupE[Z m ] < 0. 



Vz > 0. 



kz 2 



\\S\\ 



dz 



*\\S\\s 
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Finally, we have 



X(S) = lim E 

771— »00 



L m {S) 



rn 



lim l^E[Z m } + \ n (S) + ^^ + ^k 
m->-oo V n vn 



< A . (fl) + 3ig- + *|S|I.VInn 



where we used c n \/lnn = d n . Since A n (5) < A(5) by subadditivity, this proves the lemma. □ 

Lemma 6.3 (McDiarmid's Inequality |14|). Let Z\, Z\, . . . , Z m be i.i.d. random variables that 
take values in a set D, and let g : D m — > R be a function of m variables with the property that 



max 
i=l,...,m 



sup \g(zi, ...,z m )- g(z 1 , .,z m )\ < C. 



z£D m ,Zi£D 



Thus, changing a single argument of g changes its image by less than a constant C. Then the 
following bounds hold, 

2e 2 m ' 



P [g(Zi,. . . , Z m ) - E[g(Zi, Z m )] > e x m\ < exp 
P [E [g(Z 1 ,. . . , Z m )\ - g(Zt, Z m ) > e x m] < exp 



C 2 
2e 2 m 



Proof. A consequence of the Azuma-Hoeffding Inequality, see |14| . □ 

Lemma 6.4. Under the notation introduced in Lemma \6.2\ and its proof, it is true that for all 
(f, s) S V m%n the following bound applies, 

ems,™ *m. 



m n 



Proof. Assuming first that n — rj_i + Sj — Sj_i = n, we first note that, by the i.i.d. nature of the 
random variables Xj and and by symmetry of the scoring function S, the following random 
variables are identically distributed, 

As (X r ._ 1+1 . . . X ri ,Y Si _ 1+ i . ..Y s .) , 
L s (Xi . ..X Ti - ri _^Y\ . . . Y 8i - H _^) , 
Ls {X n - ri _ 1+ i . . . X n , Y Si - Si _ 1+ i ...Y n ). 
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Furthermore, it must be true that 

L s (X 1 ... X ri _r i _ 1 ,Yi . ..Y 8 .-si-i) + L s {X ri - n _ 1+ i . . .X n ,Y Si ^ Si _ 1+1 ...Y n ) 

<L S (X 1 ...X n ,Y 1 ...Y n ), 

since any alignments of the two pairs of strings in the left-hand side can be concatenated to yield 
a valid alignment of the pair of strings in the right-hand side. Taking expectations, we find 

(6.9) 2E[L s {X ri _ 1+1 ...X n ,Y Si _ 1+1 ...Y Si )] < E[L n \. 



Next, allowing n — rj_i + — Sj_i any value in {n — l,n, n + 1}, this situation is obtained 
from the previous case by lengthening or shortening at most one of the strings involved by at 
most one letter. By Lemma [6.1} such an amendment cannot change the optimal alignment score 
by more than H^Hocm so that (16. 9ft gives rise to the inequality 



(6.10) E [L S (X ri _ 1+1 . . . X ri ,Y Si _ 1+1 . . . Y Si )] < ^1 + \\S\U 

which applies to the general situation. Taking expectations on both sides of (|6.4p and substi- 
tuting (|6.10p . we find 

E[L m (S,f^]< 2kE[L / S) h 2k\\S\\ 00 . 
Division by m now yields the claim of the lemma. □ 
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